Linear Classification: f(xi,W)=Wxf(x_i,W) = W \cdot x

Matrix multiply: stretch x to a one-dimension vector,W is a matrix.

Multiclass SVM Loss:

Let f(xi,W)f(x_i,W) be scores,then the SVM scores has the form: Li=jyimax(0,sjsyi+1)L_i = \sum_{j\neq y_i}\max(0,s_j-s_{y_i}+1)

syis_{y_i} is the correct label’s score,while sjs_j is the wrong label’s scores. When sjs_j is larger than syi1s_{y_i} - 1

,that means it contributes to the loss,so that LiL_i is greater than 00.

Characteristics: 1.When give the syis_{y_i} a little bit change,the Loss function will not change. Because after change,syis_{y_i} is still 1 more than the wrong label’s scores.

min possible : 0 max:++\infty

When all scores are small random values,loss is C1C - 1(sjsyis_j \approx s_{y_i}) where C stands for the number of categories.

Regularization

L(W)=1Ni=1NLi(f(xi,W),yi)+λR(W)L(W)=\frac{1}{N}\sum_{i=1}^NL_i(f(x_i,W),y_i)+\lambda R(W)

The most common regularization: L2-norm ijWi,j2\sum_i\sum_jW_{i,j}^2

Why we need that?:

  • Express preferences in among models beyond “minimize training error”,allow people to integrate their wisdom and knowledge they’ve already obtained.

  • Avoid overfitting

    Example: x=[1,1,1,1]w1=[1,0,0,0]w2=[0.25,0.25,0.25,0.25]x = [1,1,1,1] \newline w_1=[1,0,0,0] \newline w_2=[0.25,0.25,0.25,0.25]

    It’s obvious that w1Tx=w2Tx=1w_1^\mathrm T \cdot x = w_2^\mathrm T\cdot x = 1

    L2-norm regularization prefer more balanced matrix,which is w2w_2 in this example. This implies that use as many functions as possible in this preference.”spread out the weights”

    prefer simple models: Occam’s Razor reveals the truth that simplicity is much preferred.

Cross Entropy Loss

SoftMax function:

cat3.224.50.13
car5.1164.00.87
frog-1.70.180.00

​ unnormalized log-prob/logits —exp—> unnormalized prob —normalize—>probabilities

Li=lnP(Y=yiX=xi)L_i = -\ln P(Y = y_i |X = x_i) Maximum Likelihood Estimation

min possible loss:0 (it can only approach to 0 but never truly reach) max:++\infty

When all scores are small random values,loss is lnC-\ln C where C stands for the number of categories.